Skip to content

fix(docx-core): order-constrained TF-IDF paragraph matching prevents phantom redlines#62

Merged
stevenobiajulu merged 1 commit intomainfrom
fix/paragraph-similarity-alignment
Mar 27, 2026
Merged

fix(docx-core): order-constrained TF-IDF paragraph matching prevents phantom redlines#62
stevenobiajulu merged 1 commit intomainfrom
fix/paragraph-similarity-alignment

Conversation

@stevenobiajulu
Copy link
Copy Markdown
Member

Summary

Fixes #61

  • Replace greedy first-match Jaccard similarity fallback with order-constrained gap matching + TF-IDF cosine similarity
  • Pass 1 exact-hash anchors divide documents into gaps; Pass 2 runs mini-LCS within each gap, guaranteeing document order preservation
  • TF-IDF down-weights legal boilerplate words ("holders", "Preferred Stock") that inflated false Jaccard matches
  • Add NVCA COI regression test (234 vs 175 paragraphs, 94 footnote refs removed)

Before: Greedy first-match allowed Source[45] (dividends) to steal Revised[20] (liquidation) via shared boilerplate → garbled reject-all → rebuild fallback → 949 phantom insertions

After: Gap constraints prevent cross-anchor matches; Source[50] correctly matches Revised[20] → inplace succeeds → 332 insertions, clean accept/reject round-trip

Test plan

  • NVCA COI regression: inplace mode, no fallback, insertions < 500, accept/reject text parity
  • Full docx-core suite: 1053 passed, 1 skipped
  • Full docx-mcp suite: 636 passed
  • Manual Word review: no phantom changes, clean Accept All / Reject All
  • TypeScript build clean (no errors)

…phantom redlines

Replace greedy Jaccard similarity fallback with two improvements:

1. Order-constrained gap matching: Pass 1 exact-hash anchors divide documents
   into gaps. Pass 2 similarity matching is scoped to each gap via mini-LCS,
   guaranteeing document order preservation.

2. TF-IDF cosine similarity: Replaces Jaccard, which over-weights common legal
   boilerplate words ("holders", "Preferred Stock", "Corporation"). IDF
   down-weights high-frequency terms; cosine similarity on TF-IDF vectors
   produces more discriminating scores.

Root cause: the old greedy first-match algorithm iterated unmatched source
paragraphs in order, allowing early low-similarity matches to consume revised
paragraphs intended for higher-similarity matches later in the document. On
legal boilerplate (NVCA COI), this caused incorrect paragraph alignment,
garbled reject-all output, and fallback to the rebuild reconstruction path
with ~950 phantom insertions.

Fixes #61
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
site Ready Ready Preview, Comment Mar 27, 2026 2:51am

Request Review

@github-actions github-actions bot added the fix label Mar 27, 2026
@stevenobiajulu stevenobiajulu enabled auto-merge (squash) March 27, 2026 02:51
@stevenobiajulu stevenobiajulu merged commit 2a1b5a0 into main Mar 27, 2026
20 checks passed
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 87.84530% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ocx-core/src/baselines/atomizer/hierarchicalLcs.ts 87.84% 22 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Greedy paragraph similarity fallback misaligns legal boilerplate, causing phantom redlines and rebuild fallback

1 participant